feat: let `nw.Enum` accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1) by camriddell · Pull Request #2192 · narwhals-dev/narwhals

camriddell · 2025-03-11T19:44:38Z

What type of PR is this? (check all applicable)

Related issues

Related issue enh: let Enum take arguments, allow it in construction #1541

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

Adds support for the nw.Enum datatype for pandas (backed by pandas.CategoricalDtype(…, ordered=True)

The current implementation diverges from pandas/Polars in two broad ways

We do not check for None, NaN, or Null (both pandas and Polars raise when they construct a CategoricalDtype/Enum with these in the categories list.
pandas allows arbitrary (hashable) objects to be stored as the categories, whereas Polars only allows integers. The current implementation is type-hinted to follow suit with pandas, but we do not perform this check instead letting the backend library raise as needed.

>>> import narwhals as nw
>>> import pandas as pd
>>> s = nw.new_series('foo', ['a', 'b', 'c'], dtype=nw.Enum(['a', 'b', 'c', 'd']), native_namespace=pd)
>>> s
┌───────────────────────────────────────────────┐
|                Narwhals Series                |
|-----------------------------------------------|
|0    a                                         |
|1    b                                         |
|2    c                                         |
|Name: foo, dtype: category                     |
|Categories (4, object): ['a' < 'b' < 'c' < 'd']|
└───────────────────────────────────────────────┘

- add conversion from native to pandas - add conversion from native to Polars

…-creation

MarcoGorelli · 2025-03-23T20:41:38Z

thanks! it's encouraging that this doesn't break downstream tests

sorry i didn't get round to it for tomorrow's release, will try to get it in for next week's one 👍

…-creation

MarcoGorelli · 2025-03-29T12:06:28Z

narwhals/_pandas_like/utils.py

+            except ImportError as exc:  # pragma: no cover
+                msg = f"Unable to convert to {dtype} to to the following exception: {exc.msg}"
+                raise ImportError(msg) from exc
+            return pd.CategoricalDtype(categories=dtype.categories, ordered=True)


not sure if we can do something pandas-specific here, as this is used by cudf and modin too - could we generalise?

I'm a little confused by this.
pandas is already a module-level import?

narwhals/narwhals/_pandas_like/utils.py

Line 13 in 5550ad8

import pandas as pd

@dangotbanned you're right- I pulled this code from a pretty old branch I had so that must have just been leftover. I'll delete it.

@MarcoGorelli I'll look into generalizing cudf and modin

MarcoGorelli · 2025-03-29T12:07:42Z

narwhals/_pandas_like/utils.py

+    if dtype == "category":
+        if native_dtype.ordered:
+            return dtypes.Enum(categories=native_dtype.categories)
        return dtypes.Categorical()


this would be a breaking change, so i'm not totally sure about it - could we preserve the current behaviour in v1 and only make this change in the main namespace? the version variable is available in this function, you can use that

I noticed a new one in (#2192) and thought I'd get them all in one sweep

* chore(typing): Resolve `_polars.utils` dtype ignores I noticed a new one in (#2192) and thought I'd get them all in one sweep * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * chore: "coverage" Just replacing the original `getattr`, there was already no coverage for that https://github.com/narwhals-dev/narwhals/actions/runs/14145863466/job/39633072966?pr=2312 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

This will be fixed in a future modin release modin-project/modin#7487

MarcoGorelli · 2025-04-06T10:29:26Z

thanks Cam - looks like there's a xpass

FAILED tests/series_only/cast_test.py::test_cast_to_enum_v1[modin[pyarrow]]

…-creation

docs/backcompat.md

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

Also adapted an error message to be closer to https://github.com/pola-rs/polars/blob/319a9a84ab573886b2a13548a8e462fee353acef/py-polars/polars/datatypes/classes.py#L694-L696

dangotbanned

Thanks for the PR @camriddell!

I've left some non-blocking comments/questions.
Looking pretty ready to me 🎉

narwhals/dtypes.py

dangotbanned · 2025-04-17T08:55:07Z

narwhals/_pandas_like/utils.py

+def non_object_native_to_narwhals_dtype(native_dtype: Any, version: Version) -> DType:
+    dtype = str(native_dtype)
+


This change seems to have been there since the first commit (3581985), but doesn't seem to be documented?

It looks like this part is related

https://github.com/camriddell/narwhals/blob/d2504a40efc606d8e626a5b9049ff8054417d64c/narwhals/_pandas_like/utils.py#L320-L321

Which would mean we do the str(...) call twice now. Just an observation, not sure if there is a cost to that

https://github.com/camriddell/narwhals/blob/d2504a40efc606d8e626a5b9049ff8054417d64c/narwhals/_pandas_like/utils.py#L306-L309

Are all non-object pandas data types guaranteed to be immutable?
I think str was used because it is hashable, so is safe to use in functools.lru_cache

Which would mean we do the str(...) call twice now. Just an observation, not sure if there is a cost to that

I think the cost of a repeated call to str should be fairly negligible, we can always come back later to refactor if a profiler disagrees with this statement and this leads to a larger overhead.

Are all non-object pandas data types guaranteed to be immutable?
I think str was used because it is hashable, so is safe to use in functools.lru_cache

Since the tests pass, I am at least confident that all of the datatypes are hashable, however whether that hash is something meaningful or just the default id(self) / 16 then caching may not be reliable. That said, perhaps we can also leave as is for now, then if we catch wind of a slow down in the future we can revisit it? Trying to avoid the pre-mature optimization scenarios here.

@camriddell agreed on the str part.

My concern on the hashability though is related to #2051 (comment)

Right now we won't get a warning like that because we have:

native_dtype: Any

However - good news!
I changed it to this locally:

@functools.lru_cache(maxsize=16) def non_object_native_to_narwhals_dtype( native_dtype: pd.api.extensions.ExtensionDtype, version: Version ) -> DType:

And followed though to the docs to find:

ExtensionDtypes are required to be hashable. The base class provides

Looks like we're all good 🙂

Great find on that one, thanks so much for diving in there!

dangotbanned · 2025-04-17T09:51:02Z

narwhals/_dask/utils.py

+        if isinstance(dtype, dtypes.Enum):
+            import pandas as pd
+
+            # NOTE: `pandas-stubs.core.dtypes.dtypes.CategoricalDtype.categories` is too narrow
+            # Should be one of the `ListLike*` types
+            # https://github.com/pandas-dev/pandas-stubs/blob/8434bde95460b996323cc8c0fea7b0a8bb00ea26/pandas-stubs/_typing.pyi#L497-L505
+            return pd.CategoricalDtype(dtype.categories, ordered=True)  # pyright: ignore[reportArgumentType]


@MarcoGorelli does this seem like it could be widened upstream (https://github.com/pandas-dev/pandas-stubs)?

I've traced back the runtime check to is_list_like:

https://github.com/pandas-dev/pandas/blob/5fef9793dd23867e7b227a1df7aa60a283f6204e/pandas/_libs/lib.pyx#L1236-L1282

The current annotation only permits list[Any] as a non-pandas Sequence

https://github.com/pandas-dev/pandas-stubs/blob/8434bde95460b996323cc8c0fea7b0a8bb00ea26/pandas-stubs/core/dtypes/dtypes.pyi#L36

@camriddell ignore this, I only meant to add as a comment - not the review 🫣

@MarcoGorelli gentle nudge on this, in case it was missed

hey - yeah, probably, the pandas stubs definitely don't get all the attention they probably deserve

MarcoGorelli

This is going to cause issues for people even just inspecting the schema of a dataframe:

In [1]: import narwhals as nw

In [2]: import pandas as pd

In [3]: s = pd.Series([1,2,3], dtype=pd.CategoricalDtype(ordered=True))

In [4]: nw.from_native(s, series_only=True)
Out[4]: 
┌──────────────────────────────────┐
|         Narwhals Series          |
|----------------------------------|
|0    1                            |
|1    2                            |
|2    3                            |
|dtype: category                   |
|Categories (3, int64): [1 < 2 < 3]|
└──────────────────────────────────┘

In [5]: nw.from_native(s, series_only=True).dtype
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 nw.from_native(s, series_only=True).dtype

File ~/polars-api-compat-dev/narwhals/series.py:368, in Series.dtype(self)
    353 @property
    354 def dtype(self: Self) -> DType:
    355     """Get the data type of the Series.
    356 
    357     Returns:
   (...)    366         Int64
    367     """
--> 368     return self._compliant_series.dtype

File ~/polars-api-compat-dev/narwhals/_pandas_like/series.py:236, in PandasLikeSeries.dtype(self)
    232 @property
    233 def dtype(self: Self) -> DType:
    234     native_dtype = self.native.dtype
    235     return (
--> 236         native_to_narwhals_dtype(native_dtype, self._version, self._implementation)
    237         if native_dtype != "object"
    238         else object_native_to_narwhals_dtype(
    239             self.native, self._version, self._implementation
    240         )
    241     )

File ~/polars-api-compat-dev/narwhals/_pandas_like/utils.py:321, in native_to_narwhals_dtype(native_dtype, version, implementation)
    319     return arrow_native_to_narwhals_dtype(native_dtype.pyarrow_dtype, version)
    320 if str_dtype != "object":
--> 321     return non_object_native_to_narwhals_dtype(native_dtype, version)
    322 elif implementation is Implementation.DASK:
    323     # Per conversations with their maintainers, they don't support arbitrary
    324     # objects, so we can just return String.
    325     dtypes = import_dtypes_module(version)

File ~/polars-api-compat-dev/narwhals/_pandas_like/utils.py:260, in non_object_native_to_narwhals_dtype(native_dtype, version)
    258         return dtypes.Categorical()
    259     if native_dtype.ordered:
--> 260         return dtypes.Enum(native_dtype.categories)
    261     return dtypes.Categorical()
    262 if (match_ := PATTERN_PD_DATETIME.match(dtype)) or (
    263     match_ := PATTERN_PA_DATETIME.match(dtype)
    264 ):

File ~/polars-api-compat-dev/narwhals/dtypes.py:464, in Enum.__init__(self, categories)
    462     if not isinstance(cat, str):
    463         msg = f"{type(self).__name__} categories must be strings; found data of type {type(cat).__name__!r}"
--> 464         raise TypeError(msg)
    465     seen.add(cat)
    466 self.categories = sequence

TypeError: Enum categories must be strings; found data of type 'int'

In particular, it would be a breaking change for Altair users, who'd no longer be able to plot pandas dataframes where columns are of categorical dtype and have non-string categories. It's probably not showing up at the moment in the downstream tests because we were careful to use narwhals.stable.v1

dangotbanned · 2025-04-17T12:47:30Z

#2192 (review)

This is going to cause issues for people even just inspecting the schema of a dataframe

That's a good point @MarcoGorelli

If someone is currently doing that operation, on v1, it would look like this:

import pandas as pd

from narwhals.stable import v1 as nw_v1

s = pd.Series([1, 2, 3], dtype=pd.CategoricalDtype(ordered=True))
>>> nw_v1.from_native(s, series_only=True).dtype
Categorical

So far we've had two options:

Being lax with categories 1, 2
Using the stricter rules from polars 3, 4

I see two other tweaks we could do to option 2 - when we can't meet the constraints of pl.Enum

Just continue mapping pd.CategoricalDtype -> nw.Categorical
- No change in behavior, breaks no-one
Use an alternative constructor for pd.CategoricalDtype -> nw.Enum
- So we'd still reject nw.Enum([1, 2, 3])
- But we'd allow existing ordered categoricals to be represented by an ordered type

I think either of those would solve the problem, but I think the simplest is to just keep using nw.Categorical

dangotbanned · 2025-04-17T12:47:53Z

`altair`-related

@MarcoGorelli

In particular, it would be a breaking change for Altair users, who'd no longer be able to plot pandas dataframes where columns are of categorical dtype and have non-string categories.
It's probably not showing up at the moment in the downstream tests because we were careful to use narwhals.stable.v1

I could be wrong, but I don't think we have any paths that would hit this - even if we weren't using v1?
AFAICT, pandas is handled natively - since the type conversion logic predates narwhals and (I assume) we didn't wanna make a breaking change.

Impl

Tests

Docs

https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types

MarcoGorelli · 2025-04-17T13:38:53Z

This is the part that would break in Altair:

https://github.com/vega/altair/blob/f1e0049e6f6669ec46ec462cec81ce62aae8cbf2/altair/utils/core.py#L670-L671

It would also affect Plotly and other libraries

I think it's fine to be laxer here - Polars only allows string column names, but we allow pandas dataframe with non-string column names. Similarly, we can allow pandas dataframes with non-string categories

I think it's legit to do something like

s: nw.Series
categories = list(s.dtype.categories)
categories.append(new_value)
nw.new_series('a', values, dtype=nw.Enum(categories))

where the categories are taken from user inputs. If the user is starting with something which a backend permits, they can continue with that, no issues

dangotbanned · 2025-04-17T14:32:36Z

This is the part that would break in Altair:

vega/altair@f1e0049/altair/utils/core.py#L670-L671

Well spotted @MarcoGorelli, I stand corrected 😄

where the categories are taken from user inputs. If the user is starting with something which a backend permits, they can continue with that, no issues

I guess I'm just more in the camp of what @camriddell said in (#2192 (comment))

If we only let backends raise, we will hit an issue where some code only work with specific backends which reduces the purpose of Narwhals.
With the current Enum targeting the pandas_like and Polars backends, I see this primarily happening in the space where writing code with a pandas backend in mind will break if a user passes in a Polars DataFrame because the Enum(…) had non-string categories.

It just seems to me like we're introducing a footgun by deviating from how polars interprets the same situation:

import pandas as pd
import polars as pl

# NOTE: Strings
>>> pl.Series(pd.Series(["1", "2", "3"], dtype=pd.CategoricalDtype(ordered=True))).to_pandas()
0    1
1    2
2    3
Name: , dtype: category
Categories (3, object): ['1', '2', '3']

# NOTE: Not strings
>>> pl.Series(pd.Series([1, 2, 3], dtype=pd.CategoricalDtype(ordered=True))).to_pandas()
0    1
1    2
2    3
Name: , dtype: int64

Important

Happy to follow your lead on this @MarcoGorelli, just wanna make sure I've raised my concerns 🙂

MarcoGorelli · 2025-04-17T18:43:56Z

sure, thanks for explaining

true, there is a risk that someone writes something which doesn't end up working for polars, but i'd rather accept that risk than disallow people from passing valid pandas dataframes to narwhals

i think one reason for narwhals' relatively rapid growth has been that, relative to similar/competing projects, we've put a lot of emphasis on there not being any cost to existing pandas users

…-creation

camriddell · 2025-04-18T17:56:46Z

@MarcoGorelli I believe the current version meets the changes you requested? When you have a chance can you take another look?

MarcoGorelli

thanks, looks good to me!

@dangotbanned any objections?

dangotbanned · 2025-04-18T19:13:36Z

thanks, looks good to me!

@dangotbanned any objections?

@MarcoGorelli just wanna double check this was what you asked for?

remove enum duplication/null checks

I thought in (#2192 (comment)) you just wanted to allow non-strings - not allow duplicates and None.

But no objections from me

MarcoGorelli · 2025-04-18T19:52:44Z

thanks!

the backends themselves already disallow duplicates and nulls, so tbh i'm not super-bothered, especially given that people will usually be inspecting schemas of dataframes containing enums rather than making new ones

camriddell added 3 commits March 11, 2025 12:34

enh nw.Enum to accept categories

3581985

- add conversion from native to pandas - add conversion from native to Polars

add tests for nw.Enum(categories)

69e9d88

fix enum type checking for Enum dtype

9e63e68

camriddell changed the title ~~Feat:~~ Feat: nw.Enum support for pandas Mar 11, 2025

camriddell added 3 commits March 11, 2025 12:58

Merge branch 'main' of github.com:narwhals-dev/narwhals into enh-enum…

3465fbf

…-creation

fix enum doctest

a570751

positive check for enum instance for pyright

6f22771

camriddell requested a review from MarcoGorelli March 12, 2025 16:03

Merge branch 'main' into enh-enum-creation

04b2ee6

camriddell requested a review from FBruzzesi March 13, 2025 18:16

Merge branch 'main' of github.com:narwhals-dev/narwhals into enh-enum…

1393ac7

…-creation

MarcoGorelli reviewed Mar 29, 2025

View reviewed changes

dangotbanned added a commit that referenced this pull request Mar 29, 2025

chore(typing): Resolve _polars.utils dtype ignores

a17706d

I noticed a new one in (#2192) and thought I'd get them all in one sweep

dangotbanned mentioned this pull request Mar 29, 2025

chore(typing): Resolve _polars.utils dtype ignores #2312

Merged

10 tasks

camriddell added 3 commits March 31, 2025 09:16

enum use implementation specific CategoricalDtype

6e85c1a

enum preserve v1 behavior

70d5c67

preserve v1 polars enum conversion

e4f2d87

MarcoGorelli added the enhancement New feature or request label Apr 4, 2025

camriddell added 9 commits April 4, 2025 07:23

add Enum support to dask

86eb3a2

modin to xfail on Enum dtype

3e1db9e

This will be fixed in a future modin release modin-project/modin#7487

Merge remote-tracking branch 'upstream/main' into enh-enum-creation

5a996b1

Enum support outside of V1

1ce56ca

Fix v1 enum missing argument teset

a8f6e42

fix enum error match for py38

342ea83

add pragma: no cover to v1.Enum from aligned with DType class

5dd7c72

parametrize api versions for dtypes tests

1bf98d0

decouple narwhals versioned dtypes

da1a455

Merge branch 'main' of github.com:narwhals-dev/narwhals into enh-enum…

7aca1b8

…-creation

camriddell requested a review from dangotbanned April 16, 2025 20:45

Merge branch 'main' into enh-enum-creation

eb0b328

dangotbanned reviewed Apr 16, 2025

View reviewed changes

docs/backcompat.md Outdated Show resolved Hide resolved

Update docs/backcompat.md

d2504a4

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

dangotbanned self-requested a review April 17, 2025 08:07

dangotbanned added 2 commits April 17, 2025 10:13

fix(typing): Resolve some Enum.categories issues

def83a3

refactor: Simplify __init__, raise earlier

f89e331

Also adapted an error message to be closer to https://github.com/pola-rs/polars/blob/319a9a84ab573886b2a13548a8e462fee353acef/py-polars/polars/datatypes/classes.py#L694-L696

dangotbanned reviewed Apr 17, 2025

View reviewed changes

MarcoGorelli requested changes Apr 17, 2025

View reviewed changes

camriddell added 2 commits April 18, 2025 09:44

remove enum duplication/null checks

3902a29

Merge branch 'main' of github.com:narwhals-dev/narwhals into enh-enum…

1e3326f

…-creation

camriddell force-pushed the enh-enum-creation branch from fd5ac63 to 1e3326f Compare April 18, 2025 16:45

MarcoGorelli approved these changes Apr 18, 2025

View reviewed changes

dangotbanned approved these changes Apr 18, 2025

View reviewed changes

dangotbanned changed the title ~~Feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1)~~ feat: let nw.Enum accept categories, map pandas ordered categorical to Enum (only in main namespace, not stable.v1) Apr 18, 2025

dangotbanned merged commit 0eff60b into narwhals-dev:main Apr 18, 2025
30 checks passed

dangotbanned linked an issue Apr 18, 2025 that may be closed by this pull request

enh: let Enum take arguments, allow it in construction #1541

Closed

This was referenced Apr 18, 2025

enh: let Enum take arguments, allow it in construction #1541

Closed

chore: Add Version.dtypes, remove import_dtypes_module #2431

Merged

dangotbanned mentioned this pull request Dec 1, 2025

[Bug]: Enum accepts non-string categories #3338

Closed

dangotbanned mentioned this pull request Jan 23, 2026

feat: Improve support for Decimal DType #3377

Merged

10 tasks

		def non_object_native_to_narwhals_dtype(native_dtype: Any, version: Version) -> DType:
		dtype = str(native_dtype)

Conversation

camriddell commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

MarcoGorelli commented Mar 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Apr 6, 2025

Uh oh!

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned commented Apr 17, 2025

altair-related

Impl

Tests

Docs

Uh oh!

MarcoGorelli commented Apr 17, 2025

Uh oh!

dangotbanned commented Apr 17, 2025

Uh oh!

MarcoGorelli commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camriddell commented Apr 18, 2025

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Apr 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

camriddell commented Mar 11, 2025 •

edited

Loading

dangotbanned Mar 29, 2025 •

edited

Loading

dangotbanned commented Apr 17, 2025 •

edited

Loading

`altair`-related

MarcoGorelli commented Apr 17, 2025 •

edited

Loading

dangotbanned commented Apr 18, 2025 •

edited

Loading